Back

BMC Medical Research Methodology

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match BMC Medical Research Methodology's content profile, based on 43 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit.

1
Simulation-Based Comparison of ControlledInterrupted Time Series (CITS) and Multivariable Regression

ORWA, F. O.; Mutai, C.; Nizeyimana, I.; Mwangi, A.

2026-04-13 health policy 10.64898/2026.04.10.26350670 medRxiv
Top 0.1%
22.7%
Show abstract

When randomized controlled trials are impractical, interrupted time series designs offer a rigorous quasi-experimental approach to assess population level policies. Indeed, in the context of quasi-experimental designs (QEDs), the Interrupted Time Series (ITS) method is commonly thought of as the most robust. But interrupted time series designs are susceptible to serial correlation and confounding by time-varying factors associated with both the intervention and the outcome, which may result in biased inference. Thus, we provide a simulation-based contrast of controlled interrupted time series (CITS) and multivariable regression (multivariable negative binomial regression) for estimation of policy effects in count time series data. These approaches are widely used in policy evaluations, yet their comparative performance in typical population health settings has rarely been examined directly. We tested both approaches within a variety of data generating situations, differing in the series length, intervention effect size, and magnitude of lag-1 autocorrelation. Bias, standard error calibration, confidence interval coverage, mean squared error, and statistical power were assessed for performance. Both methods gave unbiased estimates for moderate and large intervention effects, although bias was more pronounced for small effects, particularly in short series. Although the point estimate performance was similar, inferential properties varied significantly. CITS always had smaller mean squared error, better consistency between model based and empirical standard errors, and confidence interval coverage near the 95% nominal levels over weak to moderate autocorrelation. By contrast, multivariable regression was more sensitive to serial dependence, leading to underestimated standard errors and undercoverage, especially at moderate to high autocorrelation, regardless of Newey-West adjustments. These findings show the benefits of using a concurrent control series and the importance of structurally accounting for serial correlation when studying population level policies with time series data.

2
Benchmarking foundation models for improving confounding control in target trial emulation

Kleper, S. L.; Melamed, R. D.

2026-05-13 epidemiology 10.64898/2026.05.09.26352820 medRxiv
Top 0.1%
19.0%
Show abstract

Machine learning models for causal inference aim to adjust for confounding factors that are associated with both an exposure and an outcome, creating a spurious biased association. But, these methods are rarely empirically evaluated to assess their success in mitigating such bias. Recent advances in knowledge representation, including both foundation models and knowledge graphs, could enrich these models, but rigorous evaluations are needed in order to assess their potential. Here, we ask whether enriching existing causal inference models with knowledge representations from foundation models can improve confounding control. Rather than using semi-simulated data to address this question, we focus on examples of real confounding: we emulate target randomized active comparator trials that are subject to confounding by indication. Our results can guide researchers aiming to develop or apply methods for discovering causal effects from observational data.

3
Validation of an AI-Assisted Framework for Systematic Bias Assessment in Observational Studies

Etminan, M.; Rezaeianzadeh, R.; Douros, A.

2026-04-28 epidemiology 10.64898/2026.04.26.26351778 medRxiv
Top 0.1%
18.4%
Show abstract

BackgroundThe rapid expansion of medical literature has led to substantial variability and frequent contradictions in study findings, making it increasingly difficult to distinguish meaningful signals from noise. Much of this variability arises from differences in study methodology, where biases such as confounding, selection bias, and reverse causation can drive spurious associations. While artificial intelligence (AI)-assisted tools have been developed to support risk-of-bias assessment, most are designed for systematic reviews and are not tailored to identifying specific epidemiologic biases in observational studies. This highlights the need for structured, scalable approaches to evaluate study validity in real-world evidence. ObjectiveTo develop and validate an AI-assisted, expert-informed, rule-based framework (EpiVise) for systematically identifying and classifying key sources of bias in pharmacoepidemiologic studies, and to assess its agreement with expert evaluation. MethodsWe conducted a validation study using recently published pharmacoepidemiologic studies from high-impact journals (post-2025). Each study was independently assessed by the framework and two expert epidemiologists, across predefined bias domains, including measured confounding, confounding by indication, selection bias, immortal time bias, and disease latency. Agreement was evaluated using weighted kappa statistics. In the absence of a gold standard, expert judgment served as the reference benchmark. In a second phase, synthetic study scenarios with predefined embedded biases were constructed to assess the frameworks ability to detect known bias structures under controlled conditions. ResultsIn analyses of published studies (10 studies; 60 ratings), agreement between the framework and expert assessments was substantial ({kappa} = 0.75; 95% confidence interval [CI], 0.60-0.86), with 12 discordant ratings (20.0%), all limited to adjacent categories and occurring primarily in the confounding by indication and selection bias domains. In synthetic study scenarios (10 studies; 50 ratings), agreement was similarly substantial, with 42 of 50 ratings concordant (84%) and a weighted kappa of 0.77 (95% CI, 0.67-0.87); discordances included both adjacent-category and extreme disagreements and were concentrated in confounding by indication, selection bias, and prevalent user bias domains. ConclusionsThis AI-assisted, expert-informed framework, EpiVise provides a scalable and reproducible approach for evaluating epidemiologic study validity, substantial demonstrating agreement comparable to expert assessment. By systematically identifying key sources of bias, the framework has the potential to enhance the rigor and consistency of evidence evaluation, support peer review, and inform clinical, regulatory, and policy decision-making. Further validation across broader study designs and domains is warranted.

4
Causal analyses using education-health linked data for England: a case study

De Stavola, B. L. L.; Aparicio Castro, a.; Nguyen, V. G.; Lewis, K. M.; Dearden, L.; Harron, K.; Zylbersztejn, A.; Shumway, J.; Gilbert, R.

2026-03-19 health policy 10.64898/2026.03.13.26348340 medRxiv
Top 0.1%
18.2%
Show abstract

IntroductionThis article summarises lessons learnt from the Health Outcomes for young People throughout Education (HOPE) Study and serves as a real world, transferable application for addressing causal questions using administrative data. The HOPE study applied causal methods to analyses of administrative data in Education and Child Health Insights from Linked Data (ECHILD) aimed at studying the effectiveness of provision for special educational needs and disability (SEND) on health and education outcomes. MethodsDefining causal questions regarding the impact of SEND provision required judicious mapping of the question onto the data, leading to the selection of appropriate measures of effect, transparent handling of the data and control of confounding factors to estimate effects. We adopted the target trial emulation framework to guide these steps. Having encountered specific computational challenges in estimating the effects of interest, we simulated data that resembled the HOPE study and used them to practice the implementation of alternative estimation methods and to study impact of some of their assumptions. ResultsThe creation and analysis of the simulated data provided valuable insights. First, we learned the importance of aligning the target of estimation with the causal question at hand. Second, we observed how deviations from assumptions specific to each estimation method can affect results. Third, we highlighted the benefits of employing alternative estimation methods as sensitivity tools that can aid the interpretation of the resulting estimates. Finally, we offer user-friendly code in two programming languages (R and Stata) and accompanying simulated data to facilitate the implementation of these methods for similar causal questions. ConclusionWe recommend users of administrative data to fully specify -and possibly revise- the causal questions they wish to address and to carefully examine and compare assumptions, implementation and results obtained using alternative estimation methods.

5
From Study Design to Executable Code: Automating Target Trial Emulation with Large Language Models

Kim, H.; Kim, M.; Kim, S.; You, S. C.

2026-03-14 health informatics 10.64898/2026.03.13.26348306 medRxiv
Top 0.1%
14.7%
Show abstract

IntroductionImplementing target trial emulation (TTE) study methods as end-to-end executable analytic code is technically demanding, and producing standardized, reproducible scripts consistently across research teams remains a persistent challenge. We aimed to develop a framework that translates free-text study descriptions into standardized analytic specifications and executable Strategus R scripts for the Observational Health Data Sciences and Informatics (OHDSI) ecosystem. MethodsWe developed THESEUS (Text-guided Health-study Estimation and Specification Engine Using Strategus), which operates through two sequential steps. Large language models (LLMs) first map descriptions of the study into a constrained JavaScript Object Notation (JSON) schema (standardization step), after which the structured specifications are converted into R scripts with a self-auditing loop for error correction (code generation step). We evaluated eight proprietary LLMs using texts extracted from the methods section of 15 OHDSI-based TTE studies, and externally validated the framework on texts from 5 non-OHDSI studies, across three input settings: primary analysis text only, full analyses text, and full methods sections. Standardization was evaluated at the study-level (whether all parameters in a study were correctly extracted) and at the field-level (sensitivity and false positive rate per individual parameter) with field-level evaluation applied to the full analyses text and full methods sections input settings. Code generation was assessed by executability of the produced R scripts before and after self-auditing. ResultsIn the standardization step, study-level accuracy across models ranged from 0.91 to 0.98 for primary analysis, 0.67 to 0.87 for full analyses, and 0.67 to 0.85 for full methods sections in OHDSI studies, whereas the corresponding ranges were 0.73 to 0.93, 0.60 to 0.87, and 0.27 to 0.47 in non-OHDSI studies. At the field-level, sensitivity across models under the full analyses text input setting ranged from 0.73 to 0.90 with 0.27 to 0.67 false positives per study in OHDSI studies, and from 0.71 to 0.90 with 0.20 to 1.00 false positives per study in non-OHDSI studies, depending on input setting. For code generation, first-run executability ranged from 0.80 to 1.00 for OHDSI studies and improved to 0.93 to 1.00 after self-auditing. In non-OHDSI studies, first-run executability ranged from 0.60 to 1.00, improving to 1.00 after self-auditing. DiscussionTHESEUS demonstrates that pairing a standardized data model with a structured analysis framework enables reliable LLM-powered automation of the coding step in observational research. THESEUS supports the reliable translation of natural-language study descriptions into executable, shareable code in standardized observational research settings. This approach has the potential to lower the technical barriers to participation in observational research for a broader range of investigators.

6
Direct and mediated effects (DME) SLCMA: a novel method for life course modelling with time-varying covariates

Beer, S.; Simpkin, A. J.; Eldeeb, S. Y.; Zar, H. J.; Stein, D. J.; Dunn, E. C.; Smith, A. D. A. C.

2026-06-06 epidemiology 10.64898/2026.05.29.26354427 medRxiv
Top 0.1%
14.5%
Show abstract

Background: In prospective cohort studies, where an exposure is collected repeatedly, interest often lies in determining whether the timing of that exposure has a differential effect on a later outcome. The Structured Life Course Modeling Approach (SLCMA), where users select between temporal hypotheses of exposure specified a priori, provides one way to analyse such longitudinal data. However, few studies using SLCMA consider the effect of time-varying covariates (TVC) which may impact associations. Methods: We present a modified version of the SLCMA - called direct and mediated effects (DME)-SLCMA - which corrects for TVC. We first develop the DME-SLCMA method, test it through simulation, and apply it to psychosocial data from the Drakenstein Child Health Study (DCHS, n=336) to investigate relationships between maternal psychopathology, TVC of socioeconomic status, and offspring depressive symptoms. Results: We found that, on average, offspring depressive symptoms score increased by 3.9% (95% CI: 1.0%-6.9%, p = 0.039) for each unit of maternal psychopathology (SRQ) at 48 months whilst adjusting for time-varying socioeconomic status (at 18, 30, 42 and 54 months). Our simulations identified several realistic scenarios where selections ignoring TVC - with TVC mediated exposure effects present - were prone to be incorrect, including our DCHS example. Conclusion: DME-SLCMA is a robust new approach for life course modelling in the presence of time-varying covariates. We recommend adjusting for TVC whenever possible, and, when not possible, our simulation study identified that scenarios where mediated effects are comparable, or greater, in magnitude to direct effects are most prone to confounding.

7
Bias and Variance of Adjusting for Instruments

Hripcsak, G.; Anand, T.; Chen, H. Y.; Zhang, L.; Chen, Y.; Suchard, M. A.; Ryan, P. B.; Schuemie, M. J.

2026-03-15 epidemiology 10.64898/2026.03.13.26348328 medRxiv
Top 0.1%
12.2%
Show abstract

Propensity score adjustment is commonly used in observational research to address confounding. Controversy persists about how to select covariates as possible confounders to generate the propensity model. A desire to include all possible confounders is offset by a concern that more covariates will augment bias or increase variance. Much of concern is over instruments, which are variables that affect the treatment but not the outcome. Adjusting for an instrument has been shown to increase bias due to unadjusted confounding and to increase the variance of the effect estimate. Large-scale propensity score (LSPS) adjustment includes most available pre-treatment covariates in its propensity model. It addresses instruments with a pair of diagnostics, ceasing the analysis if any covariate exceeds a correlation coefficient of 0.5 with the treatment and checking for an aggregation of instruments with equipoise reported as a preference score. Our simulation assesses the impact of adjusting for instruments in the context of LSPSs diagnostics. In our simulation, even when the variance of the treatment contributed by the adjusted instrument(s) exceeds an unadjusted confounder by over twenty-fold, when the correlation between the instrument(s) and the treatment was less than 0.5 and the equipoise was greater than 0.5, the additional shift in the effect estimate due to adjusting for the instrument(s) was less than the shift due to confounding by itself. Therefore, we find in this simulation that adjusting for instruments contributed a minor amount of bias to the effect estimate. This simulation aligns well with a previous assessment of the impact of adjusting for instruments and with separate empirical evidence that adjusting for many covariates surpasses attempts to identify a limited set of confounders.

8
Machine learning methodology using a masked neural network for robust genetic risk score calculation from noisy and missing data

Squires, S.; Weedon, M. N.; Oram, R. A.

2026-05-20 genetic and genomic medicine 10.64898/2026.05.18.25341725 medRxiv
Top 0.1%
10.1%
Show abstract

Purpose: Genetic risk scores (GRSs) are summaries of genetic data that can improve prediction of disease risk and progression. GRSs are increasing available but rely on high quality input data to produce good output results; with noisy or missing inputs the GRS may be inaccurate. We aimed to develop a method to produce a robust estimate of the GRS when input data is missing, noisy or both. Approach: We developed a neural network approach, named masked-MLP, for robust GRS calculation trained on a set of GRS scores calculated on clean data. The masked-MLP includes additional input data and has noise inserted during training, both which make the model more robust. Results: A GRS for type 1 diabetes (T1D) calculated on input data with 10\% of the data corrupted had a Spearman rank correlation to the clean GRS of 0.669 (0.665-0.674) while the equivalent for the masked-MLP was 0.951 (0.950-0.952). For the same data the area under the receiver operating characteristic curve for separation of T1D from population samples fell from 0.919 (0.904-0.932) to 0.808 (0.787-0.827) for the GRS while the masked-MLP fell to 0.910 (0.895-0.924). Conclusions: The masked-MLP was more robust to noise when calculating a GRS than using standard approaches. Our approach has the potential to ensure both improved research and clinical outcomes due to more reliable GRS calculation.

9
Can large language models approximate human perceptions of disease severity? An evaluation using Global Burden of Disease 2010 disability weights

Ha, Y.; Park, H.; Lee, Y.; Kim, S.; Ahn, S.

2026-05-04 health informatics 10.64898/2026.05.02.26352261 medRxiv
Top 0.1%
9.8%
Show abstract

BackgroundDisability weights (DWs) quantify the severity of health loss and are essential for estimating disability-adjusted life years in the Global Burden of Disease (GBD) framework. Conventional DW estimation relies on resource-intensive population surveys that are difficult to update or adapt to emerging health states. Large language models (LLMs) may offer a scalable alternative by approximating human perceptions of disease severity through structured judgment tasks. MethodsThis exploratory study evaluated the alignment between LLM-derived and human-derived DW rankings using 222 health states from GBD 2010. All possible pairwise comparisons (24,531 pairs, each repeated three times) were conducted across four LLMs (GPT-5 mini, GPT-5, Claude Haiku 4.5, and Claude Sonnet 4.5). DWs were estimated via probit regression and evaluated using Spearmans rank correlation and Steigers z test. The effects of prompt language (English vs. Korean), cultural role prompting, and medical specialist role prompting on alignment were examined. Additionally, the Binomial-Logit Indifference-Point (BLIP) estimator was proposed and validated through leave-one-out cross-validation for estimating DWs for health states without established values. ResultsAll four LLMs showed high rank correlation with GBD 2010 DWs (Spearmans {rho} = 0.893 to 0.909), with no significant inter-model differences. Korean-language prompting significantly improved alignment with Korean DWs ({rho} = 0.756 vs. 0.715, p = 0.011), and Korean cultural role prompting improved alignment with both GBD 2010 DWs ({rho} = 0.922 vs. 0.909, p = 0.002) and Korean DWs ({rho} = 0.738 vs. 0.715, p = 0.001). Medical specialist role prompting significantly reduced alignment with GBD 2010 DWs ({rho} = 0.895 vs. 0.909, p = 0.001). BLIP demonstrated strong agreement with GBD 2010 DWs (Pearsons r = 0.862, MAE = 0.066) and produced plausible estimates for Long COVID (mild: 0.020, moderate: 0.298, severe: 0.529). ConclusionsLLMs can approximate human perceptions of disease severity with high rank-order consistency. Prompt language and role framing significantly influenced alignment, with culturally grounded lay prompting enhancing and specialist prompting reducing correspondence with population-based DWs. BLIP provides a practical framework for generating provisional DW estimates for emerging or underrepresented health states when conventional surveys are infeasible.

10
Long-term within-person variation of routinely measured biomarkers are associated with mortality and cardiovascular health

Webster, A. J.; Drakesmith, C. W.; Perera-Salazar, R.; Steinsaltz, D.; COMPUTE team,

2026-05-05 epidemiology 10.64898/2026.05.04.26352236 medRxiv
Top 0.1%
8.3%
Show abstract

Biomarker measurements can assist with disease diagnosis and the assessment of disease risks, with the most recent measurements usually used by disease-risk models. However, a growing number of studies suggest that in addition to a biomarkers value, its inherent variability, estimated from several measurements over many days or years in an individual, can convey independent prognostic information about disease risks. Variance estimates require an individuals biomarker data to have been measured a sufficient number of times, ideally across a long time period, and are usually only available in a hospital setting or clinical trial. Furthermore, a single biomarker measurement will involve a combination of measurement-error, natural short-term variation over a daily time-period, variation over time periods of weeks and months, and slower age-dependent changes over several years. This paper develops a statistical method that accounts for these latter concerns, and applies it to Clinical Practice Research Datalink (CPRD) data collected by UK General Practitioners. It studies the associations between cardiovascular health outcomes and the within-person variances of eight routinely measured biomarkers. This involved Sequential Monte Carlo modeling to convert an individuals biomarker measurements (collected over months or years), into estimates for the biomarkers mean, linear age-dependent slope, within-person variance, and a variance due to variation on a daily time period or measurement errors. The result is a proof-of-principle that UK primary care Electronic Health Records (from CPRD) can be effectively used for this purpose. After adjusting for mean biomarker values, clear associations were found between mortality or cardiovascular disease risks and within-person variances for 6 of 8 biomarkers.

11
Estimation of hospital catchment populations using data on patient hospital use in France

Shirreff, G.; Chauvel, C.; Casalegno, J.-S.; Vanhems, P.; Dananche, C.; Redjaline, A.; Tazarourte, K.; Nunes, M.

2026-04-29 epidemiology 10.64898/2026.04.28.26351911 medRxiv
Top 0.1%
8.1%
Show abstract

BackgroundEstimates of disease burden from hospital data require well-informed estimates of the size of the catchment population. Data on patient flows from residential areas to a hospital can be used to estimate detailed catchment populations by age, year and type of hospital visit. MethodsCatchment populations were estimated for hospitals throughout France using a proportional flow approach. Data on hospital use and patient residence were accessed from the Agence Technique de lInformation sur lHospitalisation (ATIH). For patients coming from each administrative area, we calculated a preference for each hospital, and combined this with population data for the area to estimate the catchment population of each hospital. For one hospital group, we compared this with data on emergency visits, and data from a retrospective cohort study. ResultsEstimated catchment population by hospital group ranged from 4 million per year for Assistance Publique - Hopitaux de Paris (AP-HP) downwards, with the catchment population strongly reflecting geographic proximity and hospital scale. The type of hospital substantially impacted the size of the catchment area. In the analysis of a single hospital group, the size of the catchment population varied widely with the diagnostic categories associated with the hospital visit. Emergency visits represented a smaller catchment population. The estimated proportional contribution of different departments to the estimated catchment population was similar to their contribution to observed hospital admissions. Incidence rates for a respiratory virus using this catchment population estimation method were consistent with national incidence rates. ConclusionsThis study demonstrates the consistency of the proportional flow framework when applied to appropriate data on hospital usage. The study provides catchment populations for each hospital in France which can be used for burden estimates such as incidence rates, as well as providing insight into the catchment populations served. Analysis at the department geographic level provided an appropriate balance between detail of analysis and the need to mask data for anonymisation. Further analysis should explore how the size of the catchment area corresponds to the associated travel time to the hospital in question.

12
FAMES: Federated additive model using piecewise exponential survival data

Islam, N.; Luo, C.; Tong, J.; Weller, G.; Polleya, D. A.; Kent, A.; Bair, S.

2026-05-19 health informatics 10.64898/2026.05.15.26353335 medRxiv
Top 0.1%
7.2%
Show abstract

Introduction In analyses of time-to-event data, clinical characteristics can have non-linear impacts on survival outcomes, and understanding this dynamic behavior is crucial for producing real-world evidence (RWE). Nonetheless, estimating these dynamic effects is inherently challenging when utilizing real-world data (RWD), especially since sharing individual-level patient data (IPD) is heavily restricted due to regulatory limitations. Additionally, computational difficulties are exacerbated by the high dimensionality, inter-dependency, rarity, sparsity, and scarcity of features. While data augmentation through collaboration across multiple sites might address these challenges, such collaboration is often infeasible and hindered by regulatory measures that protect patient privacy, thereby preventing the sharing of IPD between sites. Objectives To address this challenge, we propose a privacy-preserving regularized algorithm that eliminates the necessity of aggregating any protected health information across sites. This algorithm employs a penalized federated additive model utilizing piecewise exponential survival (FAMES) data and estimates non-linear effects of features while accounting for non-varying confounding effects. The model is flexible and can accommodate both multiple and multivariate smooth effects simultaneously. Methods The proposed model transforms survival data into a piecewise exponential data (PED) structure and casts the semi-parametric optimization problem into a generalized additive modeling framework assuming Poisson distribution. The model uses orthonormal splines to approximate non-linear effects and incorporates L2-norm based penalty terms to control the smoothness and goodness-of-fit of these effects. The algorithm is optimized using site-specific aggregated summary statistics and is solved iteratively through the Newton-Raphson method. Results The model is employed to assess the smooth effects of clinical features, such as age and numeric laboratory values, on overall survival using RWD from approximately 874 newly diagnosed Acute Myeloid Leukemia (AML) patients treated at seven distinct sites in the United States. The model exhibited non-linear smooth effects for lactate dehydrogenase, platelets, and others underscoring their strong association with disease prognosis. The model demonstrates a lossless property, providing estimates of smooth and fixed effects that are comparable to those derived from the pooled PED. Additionally, the inference of parameters for testing the nullity of effects remains consistent. This model is communication-efficient, necessitating roughly twelve rounds of communication across sites. Conclusion We anticipate that this model can facilitate multisite collaboration and enable smaller sites to participate in generating and validating RWE, especially for rare diseases. While the model was applied within the context of AML, it is disease-agnostic and can be implemented in any other clinical context and across various sites globally without losing any generality.

13
Combining centralized and decentralized approaches to assess and ensure data quality in Eurocrine(R) via Microsoft Power BI and DataquieR

Musholt, T. J.; Clerici, T.; Bergenfelz, A.; Schmidt, C. O.; Struckmann, S.

2026-06-05 health informatics 10.64898/2026.06.04.26354884 medRxiv
Top 0.1%
7.0%
Show abstract

Background: Medical registries have gained importance in the evaluation of healthcare quality outcomes. In the absence of high-quality evidence, such as randomized controlled trials, studies based on registry data are essential for informing clinical guidelines. Methods for assessing data quality are rarely described in detail. To ensure the credibility of registry-based studies, registries must use all available technical and operational means to guarantee high data quality. Method: Eurocrine(R) is a pan-European endocrine surgical database and quality registry initially funded by the EU healthcare programme, which started in 2015 and now includes more than 200,000 interventions as of April 2025. To ensure high data quality, interactive and standardized reports are created via Microsoft Power BI, which are created both centrally and locally. In addition, comprehensive data quality analyses were performed via the R-based package dataquieR. Results: Although a multitude of technical measures (for example, input screen design and real-time plausibility checks during data entry) are in place, they are not sufficient to prevent human errors at data entry. Errors identified in the reports were corrected, and preventive measures were implemented. Overall, the data quality was assessed as very good in terms of completeness, accuracy, and consistency. Conclusion: It is very important to provide registry users with an efficient and smart tool to identify data issues, as they have the clinical information to correct them. Data quality reports generated with dataquieR represent an effective tool for registry administrators. Predesigned Microsoft Power BI reports enable participating Eurocrine(R) clinics to self-audit their data.

14
Quantifying the Optimism of Naive Cross-Validation for Binary Outcome Prediction with Repeated-Measures Predictors: A Simulation Study and Clinical Illustration

Hagan, J.

2026-05-29 epidemiology 10.64898/2026.05.27.26354222 medRxiv
Top 0.1%
7.0%
Show abstract

Background. Cross-validation (CV) is widely used to estimate predictive performance, but can overestimate performance when applied at the observation level to repeated-measures data. When continuous predictor variables are measured repeatedly within subjects and the binary outcome is defined at the subject level, naive observation-level CV introduces data leakage through within-subject dependence, producing optimistically biased estimates of the area under the receiver operating characteristic curve (AUROC). The magnitude of this bias and the performance of alternative partitioning strategies have not been formally characterized for this data structure. Methods. Three CV strategies were compared for estimating subject-level AUROC in ridge logistic regression models: naive observation-level 10-fold CV, subject-level 10-fold CV, and leave-one-cluster-out (LOCO) CV. The framework was applied to a motivating clinical dataset of daily oxygenation measures and retinopathy of prematurity outcomes among 101 extremely low birth weight infants. A factorial simulation study was conducted across 162 parameter combinations varying cluster count (20-150), intraclass correlation (0.1-0.5), within-cluster autocorrelation (0.2-0.8), and outcome prevalence (10-35%), with 500 simulated datasets per condition (76,389 valid datasets total). Results. In the motivating dataset, naive CV produced optimism of +0.078 AUROC units for severe ROP prediction (15 events, 101 subjects) and +0.031 for any ROP prediction (48 events). Subject-level 10-fold CV closely approximated LOCO (deviation [≤] 0.015). In the simulation, naive CV optimism ranged from +0.039 to +0.204 across all conditions, increasing monotonically with higher ICC, higher autocorrelation, fewer clusters, and lower event rates. Subject-level 10-fold CV was essentially unbiased relative to LOCO across all 162 conditions (mean absolute deviation = 0.002). Conclusions. Naive observation-level CV meaningfully overestimates discriminative performance in the repeated-measures binary outcome setting and should not be used. Subject-level CV partitioning effectively eliminates this bias. Accordingly, subject-level partitioning should be considered essential, not optional, when validating prediction models using repeated-measures data with subject-level outcomes.

15
Mechanism Matters: A Monte Carlo Evaluation of Estimator Validity and Collider Bias in Environmental Mixture Epidemiology

Obeng-Gyasi, E.

2026-05-26 epidemiology 10.64898/2026.05.25.26354044 medRxiv
Top 0.1%
6.5%
Show abstract

Background: Mixture epidemiology deploys sophisticated estimators, Bayesian kernel machine regression with causal mediation analysis (BKMR-CMA), quantile G-computation (QGC), and parametric G-computation, alongside conventional regression. Comparative evaluations have assumed additive, non-mediated data-generating processes, leaving conditions under which estimator choice determines causal validity uncharacterized. Methods: We developed a simulation framework using military-relevant exposure distributions (metals, per- and polyfluoroalkyl substances [PFAS], polychlorinated biphenyls [PCBs]) and allostatic load (AL) across three deployment tiers, with parameters drawn from military occupational health and contamination literature. Four data-generating processes were specified as directed acyclic graphs: direct effects with confounding (M1), full mediation through AL (M2), synergistic AL-exposure interaction (M3), and collider structure (M4). We evaluated ordinary least squares (OLS), QGC, G-computation, and BKMR-CMA on bias, root mean squared error, and 95% confidence interval coverage across 500 Monte Carlo replications at n = 500 and n = 1,000. Results: No estimator dominated across all mechanisms. Under M1, OLS and G-computation produced near-identical modest positive bias; BKMR-CMA achieved lower root mean squared error through kernel shrinkage. Under M2, BKMR-CMA exhibited severe positive bias for AL (mean bias = +0.579 SD units; coverage = 32.8%). Under M3, BKMR-CMA was the only estimator achieving nominal 95% coverage for AL (95.2%), while regression-based approaches fell to 83.6%. Under M4, G-computation produced persistent bias and near-zero coverage for lead, reflecting structural non-identification. Conclusions: Estimator validity is fundamentally mechanism-dependent. Researchers should base estimator choice on explicit causal assumptions about whether AL functions as confounder, mediator, moderator, or collider, particularly in military and occupational cohorts. We provide a mechanism-to-estimator mapping for applied researchers.

16
Operationalizing Eight-Dimensional Patient-Safety Risk Scoring at Scale: A Multi-Model Large Language Model Reliability Study

LIn, H.-M.; Lyu, J.; Wang, I.-L.

2026-06-01 health informatics 10.64898/2026.05.29.26354437 medRxiv
Top 0.1%
6.3%
Show abstract

Background: Hospital incident risk scoring has long relied on two- or three-dimensional frameworks (Severity Assessment Codes or Risk Priority Numbers),even though root cause analysis standards recognize that clinical risk is multi-factorial. The obstacle has been mainly cognitive: human reviewers cannotreliably score many dimensions across high incident volumes, so richer assessmenthas not been operationalized at scale.Objective: To extend the traditional three-dimensional FMEA to an eight-dimensional patient-safety risk feature framework, to establish a multi-modellarge language model (LLM) extraction pipeline that scores these dimensionsautomatically, and to demonstrate a variance-aware integer optimization (mean-variance integer programming, MV-IP) that provides a reproducible tie-breakingrule for incident prioritization under extraction uncertainty, rather than improvedrisk coverage.Methods: An 8-dimensional framework covering harm severity, potential harm,frequency, detectability, systemic impact, vulnerable populations, regulatoryrelevance, and economic impact was applied to 213 synthetic and 196 realcurated incident narratives. Three independent LLMs (GPT-5.4, Gemini 3.1 Pro, Grok-4.1 Fast) from different provider families extracted structured risk scores.Inter-model consistency was assessed via ICC(A,1). Among coverage-equivalentselections, MV-IP minimized inter-model variance to give a reproducible prioriti-zation rule. An English-language sensitivity analysis was conducted on 31 AHRQPSNet WebM&M cases.Results: On real cases, seven of eight dimensions reached Fair or betterinter-model reliability (ICC(A,1) 0.53 to 0.83); D5 (Systemic Impact) was theexception at Poor reliability (0.275), driven by little between-case variation ratherthan by wide model disagreement. Reliability was not uniform: two dimensionswere Excellent (D1 actual harm 0.834, D8 economic impact 0.782), two Good,and three only Fair, so some dimensions are more readily extractable than others.The same anchors gave broadly similar results on English-language narratives.When deterministic top-K selection returned several equal-coverage solutions(11 on real cases, total inter-model variance 0.205 to 1.274), MV-IP selected theminimum-disagreement set, replacing ad hoc tie-breaking with an explicit rulewithout improving coverage. Bootstrap resampling found 74% to 90% of per-casevariance estimates stable despite the three-model panel.Conclusions: The eight-dimensional framework operationalizes patient-safetyrisk features that quality teams have considered only implicitly, and three inde-pendent LLM families produced reproducible scores on most dimensions ofcurated narratives. Inter-model agreement, however, measures reproducibilityrather than clinical correctness, and high agreement does not by itself establishthat a score is right; the dimensions that are reliably extractable today (notablyD6 and D8) differ from those that are not yet (D5, and to a lesser degree D4 andD7), which has direct implications for incident-reporting form design. MV-IP con-tributes a reproducible, variance-aware tie-breaking rule rather than improvedcoverage. Validation against expert-prioritized RCA lists and deployment on rawinstitutional incident reports remain the next steps toward clinical use.

17
Accounting for Uncertainty in the Null Benchmark in Two-Stage Phase II Trials

Irlmeier, R.; Jin, Z.; Ye, F.

2026-05-18 epidemiology 10.64898/2026.05.14.26353210 medRxiv
Top 0.1%
6.3%
Show abstract

Background Simon two-stage designs for binary endpoints and their time-to-event analogues, including the Kwak and Jung method, rely on a fixed null benchmark. Their Type I error control is valid only when that benchmark is correctly specified. In practice, historical benchmarks are often inconsistent due to small samples, population heterogeneity, changing eligibility criteria, and evolving standards of care. Even modest misspecifications can substantially inflate the Type I error rate, leading to costly advancement of ineffective treatments. Methods We propose the Interval-Null Robust (INR) two-stage design framework that accounts for uncertainty in the historical null benchmark. We define the null hypothesis as a plausible range of clinically uninteresting values: p[isin][p0L, p0U] for binary endpoints and {lambda}[isin][{lambda}0L, {lambda}0U] (or equivalent survival probabilities) for time-to-event endpoints. Type I error is controlled uniformly over the full null interval: sup{theta}[isin]{theta}0 Pr{theta}(Go) [≤] . Under the monotonicity of the Go probability, the supremum occurs at the least favorable null configuration - p0U and {lambda}0L - but the design is not reduced to a point-null formulation. The interval defines the uncertainty set for error control and is used in selecting among feasible designs through robust criteria such as worst-case regret or minimal average expected sample size. Results Across representative planning scenarios for both endpoint types, classic designs calibrated to a single benchmark exhibit substantial Type I error inflation when the true null parameter exceeds the assumed planning value. INR designs maintain the nominal Type I error rate across the full null interval, directly addressing this vulnerability to benchmark misspecification. The robustness-efficiency trade-off can be managed through design constraints and robust optimization criteria while preserving uniform Type I error control. Conclusions INR two-stage designs offer a transparent framework for addressing historical control uncertainty in single-arm Phase II trials. By replacing reliance on a fixed benchmark assumption with a more realistic interval of clinically plausible null values, INR design reduces the risk of false-positive Go-decisions caused by benchmark misspecification. INR applies to both binary and time-to-event endpoints and is implemented in the open-source INRDesign R package and accompanying interactive Shiny app.

18
TrialScout links published results to trial registrations using a large language model

Ahnström, L.; Bruckner, T.; Aspromonti, D. A.; Caquelin, L.; Cummins, J.; DeVito, N. J.; Axfors, C.; Ioannidis, J. P. A.; Nilsonne, G.

2026-03-17 epidemiology 10.64898/2026.03.15.26348383 medRxiv
Top 0.1%
6.2%
Show abstract

BackgroundMultiple stakeholders need to locate results of registered clinical trials but frequently struggle to find them. Summary results of clinical trials are often not published in trial registries, and publications containing trial results are often not explicitly linked to their respective trial registrations. Finding these results is important to researchers, systematic reviewers, research funders, regulators, clinical practitioners, and patients. MethodsWe developed TrialScout, a computer program that uses a large language model to match clinical trials registered on ClinicalTrials.gov with corresponding result publications indexed in PubMed. TrialScouts performance was evaluated through comparison to human-coded matches from previous studies of results reporting rates. Subsequently, TrialScout was applied to a random sample of 9,600 completed or terminated trials. ResultsTrialScout had a sensitivity of 92.5% and a specificity of 81.2% compared to human coders. Manual review of 200 cases where TrialScout disagreed with human researchers showed that a majority (123/200, 61.5%, 95% CI, 54.4-68.3%) of disagreements were due to human errors. When used on 9,600 sampled trials in ClinicalTrials.gov, TrialScout found result publications for 6,110 (63.6%) of trials. DiscussionTrialScout reliably located results of completed clinical trials. The tool offers benefits in terms of speed and efficiency. Estimating TrialScouts accuracy is limited by the lack of a true gold standard. TrialScout can accelerate the process of locating trial results in the scientific literature and can assist in monitoring trial reporting practices.

19
A New Mixed Frequency Regression Model For Environmental Epidemiology

Shukla, N.; Bartington, S. E.; Hansell, A. L.; Lucas, T. C.

2026-06-04 epidemiology 10.64898/2026.06.03.26354801 medRxiv
Top 0.2%
4.9%
Show abstract

Background: In the absence of high-resolution response data, exposure-response modelling often relies on aggregated low-frequency exposure data, leading to loss of high-resolution information. Mixed Data Sampling (MIDAS) from econometrics offers an alternative but is limited due to its inability to make high-resolution predictions, inflexible likelihoods and penalised nonlinear functions, and limited visualization options. We propose a mixed-frequency Distributed Lag Non-linear Model (mf-DLNM) which can eliminate the need to aggregate exposure data in environmental epidemiology and provide high resolution predictions for time series studies. Methods: We evaluated the inference and predictive performance of the mf-DLNM. To evaluate its ability to estimate exposure-response relationships, we applied mf-DLNM and same-frequency (sf)-DLNM using data from the West Midlands, UK. Additionally, we compared the predictive performance of mf-DLNM with sf-DLNM and MIDAS across nine regions of England. As MIDAS cannot predict at the resolution of the predictor (daily), we compared the predictive performance of mf-DLNM and MIDAS at weekly resolution. To test the model's ability to predict high temporal resolution risk (daily), we compared sf-DLNM (with access to daily mortality counts) with mf-DLNM (with access only to weekly mortality counts). Results: In the West Midlands example, mf-DLNM performed comparably to sf-DLNM in estimating daily risk of temperature on respiratory mortality. Furthermore, mf-DLNM and MIDAS exhibited similar performance for weekly predictions. For high-resolution predictions, mf-DLNM and sf-DLNM showed nearly similar performance, despite mf-DLNM having access only to low-resolution response data. Conclusion: This mixed-frequency approach in environmental epidemiology overcomes the limitations of predicting health risks using aggregated exposure data and provides estimates of high-resolution outcomes in the absence of high-frequency health outcome datasets.

20
Causal estimands and target trials for the effect of lag time to treatment of cancer patients

Goncalves, B. P.; Franco, E. L.

2026-04-08 epidemiology 10.64898/2026.04.07.26350338 medRxiv
Top 0.2%
4.8%
Show abstract

Timeliness of therapy initiation is a fundamental determinant of outcomes for many medical conditions, most importantly, cancer. Yet, existing inefficiencies in healthcare systems mean that delays between diagnosis and treatment frequently adversely affect the clinical outcome for cancer patients. Although estimates of effects of lag time to therapy would be informative to policymakers considering resource allocation to minimize delays in oncology, causal methods are seldom explicitly discussed in epidemiologic analyses of these lag times. Here, we propose causal estimands for such studies, and outline the protocol of a target trial that could be emulated with observational data on lag times. To illustrate the application of this approach, we simulate studies of lag time to treatment under two scenarios: one in which indication bias (Waiting Time Paradox) is present and another in which it is absent. Although our discussion focuses on oncologic outcomes, components of the proposed target trial could be adapted to study delays for other medical conditions. We believe that the clarity with which causal questions are posed under the target trial emulation framework would lead to improved quantification of the effects of lag times in oncology, and hence to better informed policy decisions.